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Abstract 

A statistical model or a learning machine is called regular if the map taking a pa- 
rameter to a probability distribution is one-to-one and if its Fisher information matrix 
is always positive definite. If otherwise, it is called singular. In regular statistical mod- 
els, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal 
likelihood, can be asymptotically approximated by the Schwarz Bayes information 
criterion (BIC), whereas in singular models such approximation does not hold. 

Recently, it was proved that the Bayes free energy of a singular model is asymptot- 
ically given by a generalized formula using a birational invariant, the real log canonical 
threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values 
of RLCTs in several statistical models are now being discovered based on algebraic 
geometrical methodology. However, it has been difficult to estimate the Bayes free 
energy using only training samples, because an RLCT depends on an unknown true 
distribution. 

In the present paper, we define a widely applicable Bayesian information criterion 
(WBIC) by the average log likelihood function over the posterior distribution with 
the inverse temperature l/logn, where n is the number of training samples. We 
mathematically prove that WBIC has the same asymptotic expansion as the Bayes 
free energy, even if a statistical model is singular for and unrealizable by a statistical 
model. Since WBIC can be numerically calculated without any information about a 
true distribution, it is a generalized version of BIC onto singular statistical models. 

Keywords. Bayes marginal likelihood, Widely applicable Bayes Information Criterion 



1 Introduction 

A statistical model or a learning machine is called regular if the map taking a parameter 
to a probability distribution is one-to-one and if its Fisher information matrix is always 
positive definite. If otherwise, it is called singular. Many statistical models and learning 
machines are not regular but singular, for example, artificial neural networks, normal 
mixtures, binomial mixtures, reduced rank regressions, Bayesian networks, and hidden 
Markov models. In general, if a statistical model contains hierarchical layers, hidden 
variables, or grammatical rules, then it is singular. In other words, if a statistical model is 
devised so that it extracts hidden structure from a random phenomenon, then it naturally 
becomes singular. If a statistical model is singular, then the likelihood function cannot be 
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approximated by any normal distribution, resulting that neither AIC, BIC, nor MDL can 
be used in statistical model evaluation. Hence constructing singular learning theory is an 
important issue in both statistics and learning theory. 

A statistical model or a learning machine is represented by a probability density func- 
tion p(x\w) of x E ~R N for a given parameter w S W C M. d , where W is a set of all 
parameters. A prior probability density function is denoted by <p(w) on W. Assume that 
training samples Xi, X2, X n are independently subject to a probability density function 
q(x), which is called a true distribution. The log loss function or the minus log likelihood 
function is defined by 



1 

L n (w) = 5Jlog p(Xi 



w 



(1) 



Also the Bayes free energy F is defined by 



log / Y\_P( x i\ w ) l f( w )dw ■ 



(2) 



4=1 



This value T can be understood as the minus logarithm of marginal likelihood of a model 
and a prior, hence it plays an important role in statistical model evaluation. In fact, 
a model or a prior is often optimized by maximization of the Bayes marginal likelihood 



Goodl . 1 19651 ] , which is equivalent to minimization of the Bayes free energy. 

If a statistical model is regular, then the posterior distribution can be asymptotically 
approximated by a normal distribution, resulting that 



F = nL n (w) + ^logn, 



(3) 



where w is the maximum likelihood estimator, d is the dimension of the parameter space, 
and n is the number of training samples. The right hand side of eq.([3D is the well-known 
Schwarz Bayesian information criterion (BIC) Schwarz, 19781 ]. 

If a statistical model is singular, then the posterior distribution is different from any 
normal distribution, hence the Ba yes free ene r gy ca n not be approximated by BIC in gen- 
eral. Recently, it was proved in [Watanabd . 1 19991 . l2001al . 120091 l2010bl ] that, even if a 
statistical model is singular, 

T = nL n (wo) + A log n, 

where wq is the parameter that minimizes the Kullback-Leibler distance from a true distri- 
bution to a statistical model, and A > is a rational number called the real log canonical 
threshold (RLCT). 



Gelfand and Shilov. 


1964]. 


plavs an important role in algebraic 


z analysis 


Bernstein. 


1972, 


Sato and Shintani. 


1974. 


Kashiwara, 


Kollar . 


1997L Saito. 


2007]. 


In algebraic geometry, it represents 



a relative property of singularities of a pair of algebraic varieties. In statistical learning 
theory, it shows the asymptotic behaviors of the Bayes free energy and the generalization 
loss, which are determined by a pair of an optimal parameter set and a parameter set W. 

If a set of a true distribution, a statistical model, and a prior distribution are fixed, then 
there is an algebraic geometrical procedure which enables us to find an RLCT Hironakal . 
19641 ] . In fact, RLCTs for several statistical models and learning machines ar e being dis- 
covered. For example, RLCTs have been studied in artificial neural networks Watanabd . 
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2001bl . lAoyagi and Nagatal. l2012f] . normal mixtures [Yamazaki and W atanabe. 2003 ], re- 
duced rank regressions |Aovagi and Watanabd . 120051 ] . Bayes networks Rusakov and Geigei . 

2004 IZwiernikl . |201(1 1201 11 ] binomial mixtures, Boltzmann machines [Yamazaki and Watanabd . 
20051 ]. and hidden Markov mo dels. To study singular statistical models , new algebraic geo- 
metr ical theory is constructed (Watanabe . 2009 . Drton et al. . 20091 . Lin . 2011 . Kiraly et al. 
2012i ]. 



Based on such researches, theoretical behavior of the Bayes free energy is clarified. 
These results are very important because they indicate the quantitative difference of sin- 
gular models from regular ones. However, in general, an RLCT depends on an unknown 
true distribution. In practical applications, we do not know a true distribution, hence we 
cannot directly apply the theoretical results to statistical model evaluation. 

In the present paper, in order to estimate the Bayes free energy without any information 
about a true distribution, we propose a widely applicable Bayesian information criterion 
(WBIC) by the following definition. 



WBIC = E^[nL n (w)], 



log n 



(4) 



where E^,[ ] shows the expectation value over the posterior distribution on W = {w} that 
is defined by, for an arbitrary integrable function G(w), 



G(w) Ylp(Xi\wf ip(w)dw 



i=l 



i=i 



(5) 



In this definition, /3 > is called the inverse temperature. Then the main purpose of this 
paper is to show 

T =i WBIC. 

To establish mathematical support of WBIC, we prove three theorems. Firstly, in Theorem 
[3] we show that there exists a unique inverse temperature /3* which satisfies 

F = Ei*{nL n (w)}. 

The optimal inverse temperature f3* satisfies the convergence in probability, f3* log n — > 1 
as n — > 00. Secondly, in Theorem [5] we prove that, even if a statistical model is singular, 

WBIC^nL n (-u;o) + Alogn. 

In other words, WBIC has the same asymptotic behavior as the Bayes free energy even 
if a statistical model is singular. And lastly, in Theorem [5] we prove that, if a statistical 
model is regular, then 

WBIC nL n (w) + I log n, 

which shows WBIC coincides with BIC in regular statistical models. Moreover, a com- 
putational cost in numerical calculation of WBIC is far smaller than that of the Bayes 
free energy. These results show that WBIC is a generalized version of BIC onto singular 
statistical models and that RLCTs can be estimated even if a true distribution is unknown. 
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Variable 


Name 


eq. number 


F 


Bayes free energy 


eq.© 


G 


Generalization loss 


eq.([36D 


WBIC(n) 


WBIC 


eq.@ 


WAIC(n) 


WAIC 


eq.((37]) 


M r 1 

C[ ] 


posterior average 


eq.© 


/?* 


optimal inverse temperature 




L( W ) 


log loss function 


eq.© 


L n (V) 


empirical loss 


eq.© 


#(10) 


Kullback-Leibler distance 


eq.© 


if„(u;) 


empirical KL distance 


eq.© 


A 


real log canonical threshold 


eq.CEH]) 


m 


multiplicity 


eq.CE]) 


Q(K(w),<p(w)) 


parity of model 


eq.([20]) 


(M,g(u),a(x,u),b(u)) 


resolution quartet 


Theorem [1] 



Table 1: Variable, Name, and Equation Number 



This paper consists of eight sections. In Section 2, we summarize several notations. In 
Section 3, singular learning theory and standard representation theorem are introduced. 
The main theorems and corollaries of this paper are explained in Section 4, which are 
mathematically proved in Section 5. As the purpose of the present paper is to prove 
the mathematical support of WBIC, Sections 4 and 5 are the main sections. In section 
6, a method how to use WBIC in statistical model evaluation is illustrated using an 
experimental result. In section 7 and 8, we discuss and conclude the present paper. 



2 Statistical Models and Notations 

In this section, we summarize several notations. Table Q] shows variables, names, and 
equation numbers in this paper. The average log loss function L(w) and the entropy of 
the true distribution S are respectively defined by 



L(w) = —J q(x) log p(x\w)dx, (6) 
S = — q(x) log q(x)dx. (7) 



Then L(w) = S + D(q\\p w ), where D(q\\p w ) is the Kullback-Leibler distance defined by 

/Q\ X ) 
q{x)\og dx. 
p{x\w) 

Then D(q\\p w ) > 0, hence L(w) > S. Moreover, L(w) = S if and only if p(x\w) = q(x). 

In this paper, we assume that there exists a parameter wq in the open kernel of W 
which minimizes L(w), 

L(wq) = min L(w), 

where the open kernel of a set S is the defined by the largest open set that is contained 
in S. Note that such wq is not unique in general, because the map w t-t p(x\w) is not 
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one-to-one in general in singular statistical models. We also assume that, for an arbitrary 
w that satisfies L(w) = L(wq), p(x\w) is the same probability density function. Let po(x) 
be such a unique probability density function. In general, the set 

Wo = {w G W;p(x\w) = po(x)} 

is not a set of single element but an analytic set or an algebraic set with singularities. Let 
us define a log density ratio function, 

f(x,W) = log -, ; r , 

p{x\w) 

which is equivalent to 

p(x\w) = po(x) exp(— /(x, w)). 
Two functions K(w) and K n (w) are respectively defined by 

K(w) = J q(x)f(x,w)dx, (8) 

K nH = IX^f^w)- (9) 

i=l 

Then it immediately follows that 

L{w) = L{w )+K(w), (10) 
L n (w) = L n (w ) + K n (w). (11) 

The expectation value over all sets of training samples X±, X2, X n is denoted by E[ ]. 
For example, E[L n (u>)] = L(w) and E[i^ n (u;)] = K(w). The problem of statistical learning 
is characterized by the log density ratio function f(x,w). In fact, 

Ei[nL n {w)] = nL n (w ) + E^[nK n (w)}, (12) 
E , = J nKn(w)exp(-n(3Kn(w))ip(w)dw ^ 

J ex.p(—n/3K n (w))(p(w)dw 

The main purpose of the present paper is to prove 

T^nL n (w ) + E^[nK n (w)]. 

for j3 = 1/ log n. 
Definition. 

(1) If q(x) = po(x), then q(x) is said to be realizable by p{x\w). If otherwise, it is said to 
be unrealizable. 

(2) If the set Wq consists of a single element wq and if the Hessian matrix 

d 2 L 

at w = wo is strictly positive definite, q(x) is said to be regular for p(x\w). If otherwise, 
then it is said to be singular for p(x\w). 

Note that the matrix J{w) is equal to the Hessian matrix of K(w) and that J(u>o) is 
equal to the Fisher information matrix if the true distribution is realizable by a statistical 
model. 
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3 Singular Learning Theory 



In this section we summarize singular learning theory. In the present paper, we assume 
the following conditions. 

Fundamental Conditions. 

(1) The set of parameters W is a compact set in M. d whose open kernel is not the empty 
set. Its boundary is defined by several analytic functions, in other words, 

W = {w G M d ;7ri(u0 > 0,7r 2 (u>) > 0, ...,ir k (w) > 0}. 

(2) The prior distribution satisfies <~p(w) = ipi(w)ip2(w), where ip\{w) > is an analytic 
function and ^{w) > is a C°°-class function. 

(3) Let s > 6 and 

L s (q) = {/(*); 11/11. = ( J \f(x)\ s q(x)dx) 1/S < oo} 

be a Banach space. There exists an open set W D W such that the map W 3 w h-> f(x, w) 
is an L s (q) -valued analytic function. 

(4) The set W e is defined by 

W e = {w e W ; K(w) < e}. 

It is assumed that there exist constants e, c > such that 

(Vw£W e ) E x [f(X,w)]>cE x [f(X,w) 2 }. (15) 

Remark. (1) These conditions allow that the set of optimal parameters 

Wo = {w £ W ; p(x\w) = p(x\wo)} = {w G W ; K(w) = 0} 

may contain singularities, and that the Hessian matrix J(w) at w G Wq is not positive 
definite. Therefore K{w) can not be approximated by any quadratic form in general. 
(2) The condition eq.([T 5]) is satisfied if a true distribution is realizable by or regular for a 



statistical model Watanabd . l2010bl ] . If a true distribution is unrealizable by and singular 
for a statistical model, this condition is not satisfied in general. In the present paper, we 
study the case when eq. (fl~5]) is satisfied. 



Lemma 1. Assume Fundamental Conditions (l)-(4)- Let 

log n 

where (3q > is a constant and let < r < 1/2. Then, as n — )■ oo, 

/ exp(— n/3K n (w))(p(w)dw = o p (exp(— y/n)), (16) 

JK(w)>l/n r 

nK n (w) exp(—n(3K n (w))p(w)dw = o p (exp(— \^n))- (17) 

K(w)>l/n r 
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The proof of Lemma Q] is given in Section 



Let e > be a sufficiently small constant. Lemma Q] shows that integrals outside of 
the region W e do not affect the expectation value ~R\ 1] [nK n (w)} asymptotically, because in 
the following theorems, we prove that integrals in the region W e have larger orders than 
them. To study integrals in the region W e , we need algebraic geometrical method, because 
the set {w; K(w) = 0} contains singularities in general. There are quite many kinds of 
singularities, however, the following theorem makes any singularities be a same standard 
form. 

Theorem 1. (Standard Representation) Assume Fundamental Conditions (l)-(4). Let 
e > be a sufficiently small constant. Then there exists an quartet (M,g(u),a(x,u),b(u)), 
where 

(1) M is a d dimensional real analytic manifold, 

(2) g is a proper analytic function g : Ai — > W' e , where W' t is an open set which contains 
W e and g : {u G M; K(g(u)) / 0} — > {w G W' e ; K(w) / 0} is a bijective map, 

(3) a(x,u) is an L s (q)-valued analytic function, 

(4) and b(u) is an infinitely many times differentiable function which satisfies b(u) > 0, 
such that the following equations are satisfied in each local coordinate of Ai . 

K{g{u)) = u 2k , 
f{x,g(u)) = u k a(x,u), 
(p(w)dw = ip(g(u))\g' (u)\du = b(u)\u h \du, 

where k = &2, kd) and h = (hi, h,2, hd) are multi-indices made of nonnegative 
integers. At least one of kj is not equal to zero. 



Remark. (1) In this theorem, for u 
respectively represent 



(ui,u 2 , 



,Ud) 



G R d , notations u 2k and \u h \ 



2k 



U 



U 



u 



2kx 2k 2 
1 u 2 

hi ho 

w 



• • • U„ 



U > 



2fe d 



The singularity u = in u 2k = is said to be normal crossing. Theorem [T] shows that any 
singularities can be made normal crossing by using an analytic function w = g(u). 

(2) A map w = g(u) is said to be proper if, for an arbitrary compact set C, g~ l (C) is also 
compact. 

(3) Th e proof of Theorem [1] is given in Theorem 6.1 of (Watanabel . [2009] and iWatanabc, 



2010b ] . In order to prove this theorem, we need the Hironaka resolution Theorem Hironaka 
19641 . lAtivah1 . ll97d ]. The function w = g(u) is often referred to as a resolution map. 
(4) In this theorem, a quartet (k, h, a(x, u), b(u)) depends on a local coordinate in general. 
For a given function K(w), there is an algebraic recursive algorithm which enables us 
to find a resolution map w = g(u). However, for a fixed K(w), a resolution map is not 
unique, resulting that a quartet (Ai,g(u),a(x,u),b(u)) is not unique. 



Definition. (Real Log Canonical Threshold) Let {U a ;a G .4} be a system of local coor- 
dinates of a manifold Ai, 

M = [J U a - 
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The real log canonical threshold (RLCT) is defined by 

. d ( hj + l 
A = mm mm I —7- — 

aeA j=i V 2kj 



(18) 



where we define 1/kj =00 for kj = 0. The multiplicity m is defined by 

hi + 1 

where #5 shows the number of elements of a set S. 



m = max 



This concept RLCT is well known in algebraic geometry and statistical learning theory. 
In the following definition we introduce a parity of a statistical model. 

Definition. (Parity of Statistical Model) The support of (p(g(u)) is defined by 



supp <p(g(u)) = {u£M; g{u) e W e , <p(g(u)) > 0}, 

where S shows the closure of a set S. A local coordinate U a is said to be an essential local 
coordinate if both equations 

d f hj + 1 
A = mm — 
j=i V 2kj 

m = (hj + l)/(2kj) = A}, 

hold in its local coordinate. The set of all essential local coordinates is denoted by {U a ; a G 
A*}. If, for an arbitrary essential local coordinate, there exist S > and a natural number 
j in the set {j ; (hj + l)/(2kj) = A} such that 

(1) kj is an odd number, 

(2) {(0,0,..,0,^-,0,0,..,0) ; \ Uj \ < 5} C supp <p(g(u)), 

then we define Q(K(g(u)),ip(g(u))) = 1. If otherwise, Q(K(g(u)),(p(g(u))) = 0. If there 
exists a resolution map w = g(u) such that Q(K(g(u)),(p(g(u))) = 1, then we define 

Q{K{w)^(w)) = l. (20) 

If otherwise Q(K(w),(p(w)) = 0. If Q(K(w),<p(w)) = 1, then the parity of a statistical 
model is said to be odd, otherwise even. 



It was proved in Theorem 2.4 of Watanabd . 120091 ] that, for a given set (q,p,(p), A and 



m are independent of a choice of a resolution map. Such a value is called a birational 
invariant. The RLCT is a birational invariant. 

Lemma 2. If a true distribution q(x) is realizable by a statistical model p(x\w) , then the 
value Q(K(g(u)),ip(g(u))) is independent of a choice of a resolution map w = g(u). 

Proof of this lemma is shown in Section[5l Lemma[2]indicates that, if a true distribution 
is realizable by a statistical model, then Q(K(g(u)), cp(g(u))) is a birational invariant. The 
present paper proposes a conjecture that Q(K(g(u)),ip(g(u))) is a birational invariant in 
general. By Lemma El this conjecture is proved if we can show the proposition that, for 
an arbitrary nonnegative analytic function K (w), there exist q(x) and p(x\w) such that 
K(w) is the Kullback-Leibler distance from q(x) to p(x\w). 
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Example. Let w = (a, b, c) G M 3 and 

K(w) = (ab + c) 2 + a 2 b 4 , 
which is the Ku llback-Leibler distance of a neural network model in Example 1.6 of 



Watanabel . l200flj |. where a true distribution is realizable by a statistical model. The 



prior <p(w) is defined by some nonzero function on a sufficiently large compact set. Let a 
system of local coordinates be 

Ui = {(ai,bi,Ci) £M 3 } (i = 1,2,3,4). 

A resolution map g : IA\ U U 2 U W3 U U± — > M 3 in each local coordinate is defined by 

a = a\c\, b = b\, c = c\, 

a = a 2 , b = b 2 c 2 , c = a 2 (l - b 2 )c 2 , 

a = a 3 , 6 = 63, c = a 3 b 3 (b 3 c 3 - I), 

a = 04, b = b^c^, c = a^b^c^c^ — 1). 



Then 



K(a,b,c) = cliimh + I) 2 + a 2 b\} = a 2 2 4{l + bl(%) 
= a 2 bl(c 2 + l) = a 2 b 2 ci(l + b 2 ). 

The Jacobian determinant is 

\g'{u)\ = \ci\ = \a 2 c 2 \ 

= \a 3 bl\ = |a 4 6 4 c 4 | 2 . 

Therefore A = 3/4 and m = 1. The essential local coordinates are IA 3 and U4. In IA 3 and 
U4, the sets {n^ ; (hj + l)/(2kj) = 3/4} are respectively {63} and {c 4 }, where 2kj = 4 in 
both cases. Consequently, both kj are even, Q(K(w),(p(w)) = 0. 



Lemma 3. Assume that the Fundamental Conditions (l)-(J h ) are satisfied and that a true 
distribution q(x) is regular for a statistical model p(x\w). If wq is contained in the open 
kernel of W and if tp(wo) > 0, then 

, d 

A = 2' m = 1 ' 

and 

Q(K{w),ip(w)) = l. 

Proof of this lemma is shown in Section [SJ 

Theorem 2. Assume that the Fundamental Conditions (l)-(4) are satisfied. Then the 
following holds. 

T = nL n (wo) + Alogn — (m — 1) log log n + R n , 

where X is a real log canonical threshold, m is its multiplicity, and R n is an random variable 
which converges to a random variable in law, when n — > 00. 
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Theorem [2] was proved in the previous papers. In the case w hen q(x) is realiza ble by 



and singular for p(x\w), the expectation value of T is given by Watanabc, 



Watanabe . 


2001a 


(Watanabe . 


2009]. 



The 



results were generalized in Watanabd . l2010bj | for the case that q(x) is unrealizable. 



Remark. In practical applications, we do not know the true distribution, hence A and m 
are unknown. Therefore, we can not directly apply Theorem [2] to such cases. The main 
purpose of the present paper is to make a new method how to estimate T even if the true 
distribution is unknown. 



4 Main Results 

In this section, we introduce the main results of the present paper. 

Theorem 3. (Unique Existence of the Optimal Parameter) Assume that L n (w) is not a 
constant function ofw. Then the following s hold. 

(1) The value ~Kw[nL n (w)] is a decreasing function of (3. 

(2) There exists a unique f3* (0 < (3* < 1) which satisfies 

T = E^[nL n {w)}. (21) 

The Proof of Theorem [3] is given in Section [5j Based on this theorem, we define the 
optimal inverse temperature. 



Definition. The unique parameter /3* that satisfies eq. (|21|) is called the optimal inverse 
temperature. 

In general, the optimal inverse temperature /3* depends on a true distribution q(x), a 
statistical model p(x\w), a prior f(w), and training samples. Therefore (3* is a random 
variable. In the present paper, we study its probabilistic behavior. Theorem 0] is a math- 
ematical base for such a purpose. 

Theorem 4. (Main Theorem) Assume Fundamental Conditions (l)-(4) and that 

<>-£-, 

log n 

where (3q is a constant. Then there exists a random variable U n such that 



K[nL n (w)] = nL n (w ) + + uj + O p (l), 

Po y ^Po 

where A is the real log canonical threshold and U n is a random variable, which satisfies 
E[[/ n ] = 0, converges to a gaussian random variable in law as n — > oo. Moreover, if a true 
distribution q(x) is realizable by a statistical model p(x\w), then E[(C7 n ) 2 ] < 1. 



The proof of Theorem [4] is given in Section Theorem [4] with /3q = 1 shows that 



WBIC = nL n (w ) + Alogn + U n d + O p (l), 

whose first two main terms are equal to those of J- in Theorem [2l From Theorem 2] and 
its proof, three important corollaries are derived. 
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Corollary 1. // the parity of a statistical model is odd, Q(K(w),ip(w)) = 1, then U n = 0. 
Corollary 2. Let (3* be the optimal inverse temperature. Then 

a* 1 fi , Un ' 1 



log n V ^/2A log n p V -y/logn 

Corollary 3. Let (3\ = /3oi/logn and /?2 = /3o2/logn, where /3oi and A)2 « r e positive 
constants. Then the convergence in probability 

m - i/h * A (22) 

holds as n — >■ oo, where A is i/ie reaZ /og canonical threshold. 
Proofs of these corollaries are given in Section \E\ 

The well-known Schwarz BIC is defined by 

BIC = nL n (w) + ^logn, 

where w is the maximum likelihood estimator. WBIC can be understood as the generalized 
BIC onto singular statistical models, because it satisfies the following theorem. 

Theorem 5. If a true distribution q{x) is regular for a statistical model p(x\w), then 



WBIC = nL n (w) + - logn + o p (l). 



Proof of Theorem [5] is given in Section [5j This theorem shows that the difference of 
WBIC and BIC is smaller than a constant order term, if a true distribution is regular for 
a statistical model. This theorem holds even if a true distribution q{x) is unrealizable by 
p(x\w). 

Remark. Since the set of parame ters W is assumed to be compact, it is proved in Main 
Theorem 6.4 of [Watanabel . hOQ^ that nL n (wo) — nL n (w) is a constant order random 



variable in general. If a true distribution is regular for and realizable by a statistical 
model, its average is asymptotically equal to d/2, where d is the dimension of parameter. 
However, if a true distribution is singular for a statistical model, then it is much larger than 
d/2, because it is asymptotically equal to the maximum value of the Gaussian process. 
Hence replacement of nL n (wo) by nL{w) is not appropriate in singular model evaluation. 

5 Proofs of Main Results 

In this section, we prove the main theorems and corollaries. 

5.1 Proof of Lemma [1] 

Let us define an empirical process, 



1 n 



n 

8=1 
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It was proved in Theorem 5.9 and 5.10 of [Watanabel . 1200^ that T) n (w) converges to a 
random process in law and 



|77„|| = sup \rj n {w) 



also converges to a random variable in law. If K(w) > l/n r , then 



n 



K n (w) = nK(w) - y/nr] n {w) 



> n l r -y/n \\r) n \ 



By the condition 1 — r > 1/2 and (3 = /3o/logn, 

exp(Vrt) / exp(— n{3K n (w))ip(w)dw 

JK(w)>l/n r 

< exp(-n 1 " r /3 + v 7 ^ + Vn/3\\r] n \\) , 

which converges to zero in probability, which shows eq. (|16p . Then, let us prove eq. (|17p . 
Since the set of parameter W is compact, \\K\\ = sup w K{w) < oo. Therefore, 

\nK n (w)\ < n\\K\\ + -\/ri\\r] n \\ 

= n {\\K\\ + IM/vH 

Hence 

exp(\/n) / \nK n (w)\ exp(— n(3K n (w))(p(w)dw 

JK(w)>l/n r 

< (\\k\\ + MMO 

x exp(-n 1_r /3 + y/n + Vn/3\\ri n \\ + logn), 
which converges to zero in probability. (Q.E.D.) 

5.2 Proof of Lemma [3] 

Without loss of generality, we can assume wq = 0. Since q(x) is regular for p(x\w), there 
exists w* such that 

K{w) = —w ■ J(w*)w, 

where J(w) is given in eq. (|14|) . Since J(wq) is a strictly positive definite matrix, there 
exists e > such that, if K{w) < e, then J(w*) is positive definite. Let l\ and £2 be 
respectively the minimum and maximum eigen values of { J(w*); K(w ) < e}. Then 



I d d 

-h w) < -W ■ J(w*)w <hY^ w )- 



i=i i=i 
By using a blow-up g : U\ U • • • U Ud — > W which is represented on each local coordinate 

Ui = (Uil,U i2 , ■■;Uid), 

Wj = uuUij (j ^ i), 
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it follows that 

where -Ujj = Ujj (j 7^ i) and -u^ = 1. These inequalities show that ki = 1 in Ui, therefore 
Q(K(w),(p(w)) = 1. The Jacobian determinant of the blow-up is 

\g'{u)\ = \ Ull \ d -\ 

hence A = d/2 and m = 1. (Q.E.D.) 

5.3 Proof of Theorem [3] 

Let us define a function F n (/3) of /3 > by 



„ n 

F n (/3) = - log / JI^MM™)*"- 



Then, by the definition, J 7 = F n (l) and 

= -K{(nL n ( W )f}+K[nLn(w)} 2 . 

By the Cauchy-Schwarz inequality and the assumption that L n (w) is not a constant func- 
tion, 

f'M < 0, 

which shows (1). Since F n (0) = 0, 



F = F n (l)= f F n ((3)d{3. 
J 



By using the mean value theorem, there exists /3* (0 < (3* < 1) such that 

F = F^*)=<[nL n (w)]. 
Here F n (j3) is a decreasing function, /3* is unique, which completes Theorem [3l (Q.E.D.) 

5.4 First Preparation for Proof of Theorem |4] 

In this subsection, we prepare the proof of Theorem HI By using eq. (fT2j) and eq. (fT3j) . the 
proof of Theorem H] results in evaluating Ew[nK n (w)]. By Lemma [TJ 

^[n^H] = ^ + 0p(exp( _^ )r (23) 
where A n and -B n are respectively defined by 



A n = / exp(-npK n (w))(p(w)dw, (24) 

'K(w)<e 



B n = I nK n (w) exp(—n{3K n (w))(p(w)dw. (25) 

*K{w)<e 
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By Theorem[H an integral over {w £ W; K(w) < e} can be calculated by that over M.. For 
a given local coordinates {U a } of Ai, there exists a set of C°° class functions {ip a (g(u))} 
such that, for an arbitrary u 6 Ai, 

By using this fact, for arbitrary integrable function G(w), 

/ G(w)ip(w)dw = ^2 G{g{u))tp a {g{u))\g' (u)\du. 



Without loss of generality, we can assume that U a n supp ip(g(u)) is isomorphic to [—1, l] d 
and that ip a (g(u)) > in [— 1, 1] . Moreover, by Theorem [H there exists a function 
b a ( u ) > such that 

Va{g{u))\g' {u)\ = \u h \b a (u), 
in each local coordinate. Consequently, 

/ G(w)cp(w)dw = V / du G(g(u)) \u h \ b a {u). 
In each local coordinate, 

K(g{u)) = u 2k . 

We define a function £, n (u) by 

1 ™ 

i n {u) = -= Y7u fc - a(Xi, u)}. 
v i=i 



Then 



Note that 



holds, because 



K n {g{u))=u 2k -^ r u k Uv). 

Jn 



u k 



J a(x,u) q (x)dx 



u 2k = J f(x, g{u))q{x)dx = u k J a(x,u)q(x)dx. 
Therefore, for an arbitrary u, 

nuu)} = o. 

The function £ n ("u) can be understood as a random process on Ai. On Funda mental Con- 
ditio ns (l)-(4), it is proved in Theorem 6.1, Theorem 6.2, and Theorem 6.3 of Watanabd . 
20091 ] that 



(1) £, n (u) converges to a gaussian random process £(k) in law and 

E[supO) 2 ] ^E[sup£( U ) 2 ]. 

u u 

(2) If q(x) is realizable by p(x\w), and if u 2k = 0, then 

E[£„(u) 2 ] =K x [a(X,u) 2 } = 2. (26) 
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By using the random process £ n (u), two random variable A n and B n can be represented 
by integrals over A4, 



A n = V f du exp(-nPu 2k + ^Pu k £ n (u))\u h \b a (u), (27) 

B n = du (nu 2k - yjnu k £, n (u)) 

x exp(-n/3u 2fc + ^pu k i n {u))\u h \b a {u). (28) 
To prove Theorem we study asymptotics of these two values. 

5.5 Second Preparation for Proof of Theorem [4] 

To evaluate two integrals A n and B n as n — > oo, we have to study the asymptotic behavior 
of the following Schwartz distribution, 

5{t-u 2k ) \u\ h 

for t — > 0. Without loss of generality, we can assume that, in each essential local coordinate, 
A _ hi + 1 _ h 2 + 1 _ _ hm+1 + 1 

where m < j < d. A variable u £ M d is denoted by 

« = K,«b) ei m x K d_m . 

We define a measure du* by 

m d 

(n^-))( n («i) w )<*« 

du * - Jzi j=m+1 (-291 

where 6( ) is the Dirac delta function, and \i = (fi m+ i, fj, 2 , ■■■■,Hd) is a multi-index defined 
by 

//j = —2Xkj + hj (m + 1 < j < d). 

Then > —1, hence eq. (|2T?j) defines a measure on Ai. The support of (in* is {u = 
(u a ,u b ) ; u a = 0}. 

Definition. Let a be a ^dimensional variable, 

a = ((T 1 ,a 2 ,...,a d ) £ R d 

where Oj = ±1. The set of all such variables is denoted by S(d). We use a notation 

ou = (aiu 1 ,a 2 u 2 , .-.,cr d u d ) £ R d . 

Then (au) k = a k u k and (au) 2k = u 2k . By using this notation, we can derive the asymp- 
totic behavior of 5(t — u 2k )\u h \ for t — > 0. 
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Lemma 4. Let G(u 2k ,u k ,u) be a real-valued C\-class function of (u 2k ,u k ,u) (u G M. d ). 
The following asymptotic expansion holds as t — > +0, 



-l,l] d 



= t A - 1 (-logt) m - 1 V / du* G(t,a k Vi,u) 

+o(t A - 1 (-logt) m - 2 ), (30) 

where du* is a measure defined by eg. \29\ ). 

(Proof of Lemma H]) Let Y(t) be the left hand side of eq,(|30p. 



Y{t) = V f 5(t-(au) 2k )\au\ h G{(au) 2k ,{au) k ,au)d{au) 
V f 5(t-u 2k )\u\ h G( y t,a k Vi,u)du. 



creS(d) • 



'[0,1 

By using Theorem 4.9 of [Watanabj . |2009[ |. if u G [0, l] d , then 

5{t-u 2k )\u\ h du = t^i-logtr^du* 

+0(t x - 1 (-logt) m - 1 ). 

By applying this relation to Y(t), we obtain LemmaUl (Q.E.D.) 
5.6 Proof of Lemma [2] 

Let &(w) > be an arbitrary C°° class function on W e . Let Y(t, $) (t > 0) be a function 
defined by 



Y(t,$)= / 6(t - K(w))f(x,w)®(w)(p(w)dw, 

J K(w)<e 

whose value is independent of a choice of a resolution map. By using a resolution map 

w = g(u), 

Y(t,<S>) = J2 Yl [ du5(t-u 2k ) u k \u\ h a(x,u)$(g(u))b a (u)du. 
By Lemma HI and a = (a a ,a b ), 

= t'-^iiogtr- 1 £ £ K) fe y, (°*) fc 

q£-4* <T a €S(in) a b £S(d—m) 



x / cfax* a(x, au) &(g(au)) b a (au) 
J[o,i] d 

+0{t x - 1 / 2 {\ogt) m ~ 2 ). 

By the assumption that a true distribution is realizable by a statistical model, eq. ([26 
shows that there exists a; such that a(x, u) ^ for -u 2fc = 0. On the support of du*, 

au = (a a u a ,a b u b ) = (0,a b u b ), 
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consequently the main order term of Y(t, $>(w)) is determined by 3>(0, Ub). If Q(K(g(u)), tp(g(u))) = 
1, then at least one kj (1 < j < m) is odd, a k takes values ±1, hence 



E <t = °> 

<J a &S{m) 

which shows that the coefficient of the main order term in Y(t, (t — > +0) is zero for an 
arbitrary $(w). If Q(K(g(u)), ip(g(u))) = 0, 

E *S E ^o. 

<T a £S(m) a a £S(m) 

There exists a function $>(w) such that them main order term is not equal to zero. There- 
fore Q(K(g(u)), ip(g(u))) does not depend on the resolution map. (Q.E.D.) 



5.7 Proof of Theorem |4] 

In this subsection, we prove Theorem 0] using the foregoing preparations. We need to 
study A n and B n in eq. ([24|) and eq, (|25l) . Firstly, we study A n . 



A n = "S^ du exp(-n(3u 2k + (3y/nu k £ n (u))\u\ h b a (u) 

/f'OO 
du / dtS(t-u 2k )\u\ h b a {u) 

x exp(-n/3u 2k + (3\fnu k £, n {u)). 
By substitution t := t/(nf3) and dt := dt/(nf3), 

A n = T [ b a (u)du r ^si—- u 2k )\u\ h 

x exp(-n/3u 2k + l3^/nu k ^ n (u)). 
For simple notations, we use 

[ du* = E E / 



where ; a G .4*} is the set of all essential local coordinates. Then by using Lemma [H 
5(t/n(3 — u 2k ) can be asymptotically expanded for n/3 — > 0, hence 



(n, 
(n/3) 



x exp(-t + VWt £») + Qpi ^r' ) 



(log(n/3)) 



m— 1 /■ /"oo 

/ du* / eft i^ 1 exp(-£) exp(y / ^ £*(«)) 



(n/3) A 

^ (log(n/3)) m - 2 
pl (n/3) A 
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Since (3 = /?o/logn — > 0, 

ex P (7^ £(«)) = 1 + JWt £(u) + O p (p). 
By using the gamma function, 

POO 

T(A) = / t^ 1 exp(-t) dt, 
Jo 

it follows that 



,(log(n/3)) m - 2 
MP 



Secondly, -B^ can be calculated by the same way, 

//"CO 
du / ^ <5(i-u 2fc )|u|%(u) 
.-Li]" 

x(nu 2k - y/nu k £ n (u)) exp(-n/3u 2k + /3y/nu k £ n (u)). 
By substitution i := t/(n(3) and eft := dt/{nf3) and Lemma [H 

B „ = f du - r*(« (-«',)"-' 

■Am Jo n P Vn P y v n P y 

>4(* - £(«)) exp(-t + £(«)) + : 



dn* / f A_1 (t - y^i £*(u)) exp(-t) 



(log(n/?)) r 



/3(n/3)' 



M JO 



Therefore, 

B » - i! 1^{^ +1 )(/,/"*) + ^ r ( A+ l»(/,/»*««)) 



.M 7 ^ V.M 



. v ?r(A t i)(/ tiJ ,'£»)} + o t ( (1 *ffl 



m-2 



Let 



e 



/?(n/?) A 



By applying results of A n and B n to eq. ([25]) , 

«Sfr*.(.)] = i x r(A + l) + v ^9 { r(A + 3/2)- r(A + l/2)} 

wL v ;J /3 r(A) + v?er(A + i/2) py ! 

Note that, if a, b, c, d are constants and j3 — > 0, 

C+ ^ d ^ + V ^(^)+0(/3). 



+ 6 a \ a 
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Then by using an identity, 

r(A)(r(A + 3/2) - r(A + 1/2)) - r(A + i)r(A + 1/2) r(A + 1/2) 

r(A) 2 2r(A) ' 

we obtain 

„ , ir(A + i) e r(A + i/2) 

E^n^H] = - Jl__2 _ + Op(l). 



A random variable C/ n is defined by 



= _er(A + i/2) 



Then it follows that 



l£[ntf n W] = - + t/ n W— + O p (l). 

By the definition of £, n (u), E[0] = 0, hence E[[/ n ] = 0. By using Cauchy-Schwarz inequal- 
ity, 

Im du * 

Lastly let us study the case that q(x) is realizable by p(x\w). The support of du* is 
contained in u 2k = 0, hence we can apply eq. (f26j) to 0, 

E|e2] s ^*.-e[SM1 = 2 

The gamma function satisfies 
Hence we obtain 

which completes Theorem |H (Q.E.D.) 
5.8 Proof of Corollary Q] 

By definition eq, (|3Tj) and eq. (|32|) . it is sufficient to prove = 0, where 



V V / b a (u) du* a k i n {u) 



a€A* aeS(d) 



EE/ Uu)du* 
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The support of the measure du* is contained in the set {u = (0,-u&)}. We use a notation 
g — (&a,&b) S M m x M. d ~ m . If Q(q,p,ip) = 1 then there exists a resolution map w = g(u) 
such that a\ takes values both +1 and —1 in arbitrary local coordinate, hence 

It follows that 

<r&S(d) a b eS(d-m) a a eS(m) 

therefore, = 0, which completes Corollary [U (Q.E.D.) 
5.9 Proof Corollary d 

By using the optimal inverse temperature j3* , we define T = 1/ {(3* log n) . By the definition, 
T = E£, [nL n (w)]. By using Theorem [2] and Theorem HI 



A log n = TX log n + U n ^TX(\ogn)/2 + O p (log log n), 
which is equivalent to 

r+ ^L -i + o p f lo f logn )=o. 

V2Alogn v logn / 



Therefore, 



V8Alogn a/A logn 
resulting that 

/3*logw = l+ . ^ w +o p (- 1 



^2 A logn p \/ A log n 
which completes Corollary [2j (Q.E.D.) 

5.10 Proof of Corollary d 

By using Theorem 



E^[nL n (w)] = nL n {w ) + — + O p {^/V^n) 

Pi 

A 



E«[nL n H] = nL n (w ) + ^ + O P (\/i°g 

P2 

Since (l//?i-l//? 2 ) = O p (logn), 

= 1M _ m + Op(V0ogn), 



which shows Corollary [3l (Q.E.D.) 
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5.11 Proof Theorem [5] 

By using eq. (fl~2j) . eq. (fT3]l . the proof of Theorem [5] results in evaluating En[nK n (w)]. By 
Lemma [J for the case r = 1/4, 



E^[nK n (w)} 



D n + o p (exp(-Vn)) 



C n + o p (exp(-^/n)) ' 
where C n and D n are respectively defined by 



C n 



K<l/n 1 / i 



exp ( — n f3 K n ( w ) ) tp ( w ) d w , 



KKl/n 1 / 4 



nK n (w) exp(—n/3K n (w))(p(w)dw. 



(33) 

(34) 
(35) 



If a statistical model is regular, the maximum likelihood estimator w converges to wo in 
probability. Let J n (w) be d x d matrices defined by 



d 2 K n 
dwidwj 



(w). 



There exists a parameter w* such that 



1 



K n (w) = K n (w) + -(w - to) • J n (w*)(w - w). 
Since w — > wq in probability, u>* — > wo in probability. Then 

\\Jn(w*) ~ J(wo)\\ < \\Jn(w*) ~ Jn(w )\\ + \\J n (w ) - J (w ) 

dJ n (w) 



< \\w* — Wq\\ sup 

K{w)<l/n 1 f A 



dw 



+ \\Jn(w ) ~ J(w ) 



which converges to zero in probability as n — > oo. Therefore 

Jn(w*) = J(wq) + O p (l). 

Since a statistical model is regular, J(wq) is a positive definite matrix. 

C n = exp(-nf3K n (w)) 

x / exp( — — (w — w) ■ ( J(wq) + o p (l))(w — w))ip(w)dw. 

JK(w)<n 1 / 4 2 

By substituting 
it follows that 



u = y ra/3(u; — t/)), 
exp(-n/3i^ n (^))(n/3)- d / 2 

/l u 
exp(--tt • (J(tu ) + o p (l))u)tp(w + —j=;)du 

(2^) rf / 2 exp(-n/3if n (u;))( V (ti)) + o p (l)) 



(n/?) d / 2 det(J(w ) + o p (l)) 1 / 2 
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H 


1 


2 


3 


4 


5 


6 


WBICi Ave. 


17899.8 


3088.9 


71.1 


77.9 


83.3 


87.7 


WBICi Std. 


1081.3 


227.0 


3.7 


4.0 


4.0 


4.2 


WBIC 2 Ave. 


17828.7 


3017.9 





6.8 


12.2 


16.6 


WBIC2 Std. 


1081.2 


226.7 





1.8 


2.3 


2.3 



Table 2: WBIC in Model Selection 



By the same way, 

D n = exp(-n/3K n (w)) 

(nK n (w) + ^(u> - w) ■ (J(w ) + o p (l))(w - w)) 

AT(w)<l/n 1 /4 V I / 

(Tl/3 \ 
— 2~( w ~ ^) ' (^(^o) + °p(^))( w ~ w) ) <p(w)dw 

(27r) d / 2 eM-nf3K n (w))&(w) + o p (l)) ( d 
= (n^det(J(^) + ^(l))V2 {^ n { W ) + - + o p {l) 

Here nK n (w) = O p (l), because a true distribution is regular for a statistical model. 
Therefore, 

E^[nL n (w)\ = nL n (w ) + nK n {w) + — + o p (l), 

Zp 

which completes Theorem [5l (Q.E.D.) 



6 A Method How to Use WBIC 

In this section we show a method how to use WBIC in statistical model evaluation. The 
main theorems have already been mathematically proved, hence WBIC has a theoretical 
support. The following exeperiment was conducted not for proving theorems but for 
illustrating a method how to use it. 



6.1 Statistical Model Selection 

Firstly, we study model selection by using WBIC. 

Let x E R M , y £ R N , and w = (A,B), where A is an H x M matrix and B is an 
M x H matrix. A reduced rank regression model is defined by 

p(x,vW = ( JJ r/2 exp(-^||y-^x[| 2 ), 

where r(x) is a probability density function of x and a 2 is the variance of an output. Let 
A/"m(0,S) denote the M dimensional normal distribution with the average zero and the 
covariance matrix E. 

In an experiment, we set a = 0.1, r{x) = A/m(0,3 2 /), where / is the identity matrix, 
and ip(x) = A/"d(0, 10 2 /). The true distribution was fixed as p(x, y\wo), where wq = (Aq, Bq) 
was determined so that Aq and Bq were respectively an Hq x M matrix and an M x Hq 
matrix. Note that, in reduced rank re gression models, RLCTs and multiplicities were clar- 



ified by [Aoyagi and Watanabd . 120051 ] and Q(K(w),(p(w)) = 1 for arbitrary q(x), p(x\w) 
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and <f(w). In the experiment, M = N = 6 and the true rank was set as Hq = 3. Each 
element of Aq and Bq was taken from A/i(0,0.2 2 ) and fixed. From the true distribution 
p(x,y\wo), 100 sets of n = 500 training samples were generated. 

The Metroplois method was employed for sampling from the posterior distribution, 

p(w\Xi,X 2 , -,X n ) oc exp(-(3nL n (w) + log ip(w)), 

where (3 = 1/logn. Every Metropolis trial was generated from a normal distribution 
J\f d (0, (0.0012) 2 /), by which the exchange probability was 0.3-0.5. First 50000 Metropolis 
trails were not used. After 50000 trails, R = 2000 parameters {w r ;r = 1,2, ...,R} were 
obtained in every 100 Metropolis steps. The expectation value of a function G(w) over 
the posterior distribution was approximated by 

1 R 

r=l 

The six statistical models H = 1, 2, 3, 4, 5, 6 were compared by the criterion, 

WBIC = E$,[nL n (w)], {fi = 1/logn). 

To compare these values among several models, we show both WBICi and WBIC2 in 
Table [2j In the table, the average and the standard deviation of WBICi defined by 

WBICi = WBIC - nS n , 

for 100 independent sets of training samples are shown, where the empirical entropy of 
the true distribution 

5 n = -iVlogg(X i ) 
n f— ' 

i=\ 

does not depend on a statistical model. Also WBIC2 in Table [2] shows the average and 
the standard deviation of 

WBIC2 = WBIC - WBIC (3), 

where WBIC(3) is the WBIC for H = Hq = 3. In 100 independent sets of training samples, 
the true model H = 3 was chosen 100 times in this experiment, which demonstrates a 
typical application method of WBIC. 



6.2 Estimating RLCT 

Secondly, we study a method how to estimate an RLCT. By using the same experiment 
as the foregoing subsection, we estimated RLCTs of reduced rank regression models by 
using Corollary [3j Based on eq. ([22|) . the estimated RLCT is given by 

~ Efc[nL n (w)]-¥%[nL n (w)] 

where f3% = 1/logn and P2 = 1.5/ log n were used and 

w a 2 r t / \] E ™ i nL n(w) exp(-(/3 2 - Pi)nL n (w))] 

E^[exp(-(/? 2 -/3i)nL n H)] 
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H 


1 


2 


3 


4 


5 


6 


Theory A 


5.5 


10 


13.5 


15 


16 


17 


Theory m 


1 


1 


1 


2 


1 


2 


Average A 


5.50 


9.93 


13.44 


14.69 


15.74 


16.53 


Std. Dev. A 


0.19 


0.32 


0.47 


0.60 


0.66 


0.88 



Table 3: RLCTs for the case Hq = 3 



Theory A in Table [3] shows the theoretical values of RLCTs of reduced rank regression. 
For the cases when true distributions are unrealizable by statistical models, RLCTs are 
given by half the dimension of the parameter space, A = H(M + N — H)/2. In Table [3j 
averages and standard deviations of A shows estimated RLCTs. The theoretical RLCTs 
were well estimated. The difference between theory and experimental results was caused 
by the effect of the smaller order terms than logn. In the case the multiplicity m = 2, the 
term log logn also affected the results. 



7 Discussion 



In this section, we discuss the widely applicable information criterion from three different 
points of view. 

7.1 WAIC and WBIC 

Firstly, let us study the difference between the free energy and the generalization error. 
In the present paper, we study the Bayes free energy T as the statistical model selection 
criterion. Its expectation value is given by 

E[F]=nS + J q ^)]og^-dx n , 

where S is the entropy of the true distribution, 

n 

q(x n ) = HqQa), 
i=l 

/n 
j^J p(xi \w)<p(w)dw, 



i=\ 



and dx n = dx\dxi • • • dx n . Hence minimization of E[F] is equivalent to minimization of 
the Kullback-Leibler distance from the q(x n ) to p(x n ). 

There is a different model evaluation criterion, which is the generalization loss defined 

by 

G = — j q(x) logp*(x)dx, (36) 

where p*(x) is the Bayes predictive distribution defined by p*(x) = Ew\p(x\w)], with (3 = 1. 
The expectation value of Q satisfies 



E[Q] =S + E\ [ q(x) log 4r\ dx 
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Hence minimization of E[C7] is equivalent to minimization of the Kullback-Leibler distance 
from q{x) to p*{x). Both of J- and Q are important in statistics and learning theory, 
however, they are different criteria. 

The well-known model selection criteria AIC and BIC are respectively defined by 



AIC = L n (w) + -, 
n 

BIC = nL n (w) + - logn. 
If a true distribution is realizable by and regular for a statistical model, then 



(37) 
(38) 



E[AIC] 
E[BIC] 



n 



E[J] + 0(1). 



These relations can be generalized onto singular statistical models. We define WAIC and 
WBIC by 

WAIC = T n + V n /n, 

WBIC = Ef>L„H], = 1/ logn, 



where 



Vn 



1 - 

- Vlogp*(Xj|w), 
n 

i=i 

^{^[(logpiXilw)) 2 } - E^log^M] 2 }. 



i=i 



Then, even if a statistical model is unrealizable by and singular for a statistical model, 



1 



EWAIC] = E[0] + O(-A (39) 

n z 

E[WBIC] = E[J] + (log logn), (40) 

where eq. (139ft was proved in [Watanabel . bOQfll . l2010al |. whereas eq. (|40j) has been proved 
in the present paper. Moreover, if a statistical model is realizable by and regular for a 
statistical model, WAIC and WBIC respectively coincide with AIC and BIC, 



WAIC 
WBIC 



AIC + o p (-) 
n 

BIC + Op(l). 



Theoretical comparison of WAIC and WBIC in singular model selection is the important 
problem for future study. 



Remark. If a prior distribution is positive at the optimal set of parameters, then RLCTs 
are smaller than d/2 in singular models, resulting that both WAIC and WBIC in singular 
models are respectively smaller than AIC and BIC. Theorefore, if Bayes estimation is 
applied to singular models, a larger model can be employed with a smaller generalization 
error. If a true model is unrealizable by any finite size model, this is a good property from 
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the viewpoint of the best balance of bias and variance, however, this fact simultaneously 
means the weaker consistency in model selection. If a true model is realizable by some 
finite size model, and if the main purpose of statistical model evaluation is to find the 
true model, Jeffreys' prior is recom mended. Note th at Jeffreys' prior is equal to zero 



at singularities and A > d/2 holds Watanabd . 120091 ] . However, Jeffreys' prior is not 



appropriate to a case when a true distribution is unrealizable by a finite model. It is well 
known in statistics and learning theory that consistency in model selection is different 
from minimization of the generalization error. 

7.2 Other Methods How to Evaluate Free Energy 

Secondly, we discuss several methods how to numerically evaluate the Bayes free energy. 
There are three methods other than WBIC. 

Firstly, let = 0, 1, 2, J} be a sequence which satisfies 

= /3o</?i<---</3j = l. 

Then the Bayes free energy satisfies 

J 

^ = -^logE^- 1 [exp(-n(/3 i -/3 i _ 1 ) J L n (t ( ;))]. 

3=1 

This method can be used without asymptotic theory. We can estimate J 7 , if the number J is 
sufficiently large and if all expectation values over the posterior distributions {E^ -1 [ ]} are 
precisely calculated. The disadvantage of this method is its huge computational costs for 
accurate calculation. In the present paper, this method is referred to as 'all temperatures 
method'. 

Secondly, the importance sampling method is often used. Let H(w) be a function which 
approximates nL n (w). Then, for an arbitrary function G(w), we define an expectation 
value K w [ ] by 

E \G(w)] = / G ( w ) exp(-H(w))<p(w)dw 
f exp(—H(w))ip(w)dw 

Then 

T = -logE w [exp(-nL n (w) + H(w))] 
— log J exp(—H(w))(p(w)dw, 

where the last term is the free energy of H(w). Hence if we find H(w) whose free energy 
is analytically calculated and if it is easy to generate random samples from E u ,[ ], then 
T can be numerically evaluated. The accuracy of this method strongly depends on the 
choice of H(w). 



Thirdly, a two-step method was proposed by [Drtonl . [2010f] . Assume that we have 
theoretical values about RLCTs for all cases about true distribution and statistical models. 
Then, in the first step, a null hypothesis model is chosen by using BIC. In the second step, 
the optimal model is chosen by using RLCTs with the assumption that the null hypothesis 
model is a true distribution. If the selected model is different from the null hypothesis 
model, then the same procedure is recursively applied until the null hypothesis model 
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Method 


Asymptotics 


RLCT 


Comput. Cost 


All Temperatures 


Not used 


Not Used 


Huge 


Importance Sampling 


Not used 


Not Used 


Small 


Two-Step 


Used 


Used 


Small 


WBIC 


Used 


Not Used 


Small 



Table 4: Comparison of Several Methods 



becomes the optimal model. In this method, asymptotic theory is necessary but RLCTs 
do not contain fluctuations because they are theoretical values. 

Compare with these three methods, WBIC needs asymptotic theory but not theoretical 
values of RLCTs. Moreover, WBIC can be used even if a true distribution is unrealizable 
by a statistical model. The theoretical comparison of these four methods is shown in Table 

H 

The effectiveness of a model selection method strongly depends on a statistical condi- 
tion which is determined by a true distribution, a statistical model, a prior distribution, 
and a set of training samples. Under some condition, one method may be more effective, 
however, under the other condition, another may be. The proposed method WBIC gives a 
new approach in numerical calculation of the Bayes free energy which is more useful with 
cooperation with the conventional method. It is a future study to clarify which method is 
recommended in what statistical conditions. 



7.3 Algebraic geometry and Statistics 

Lastly, let us discuss a relation between algebraic geometry and statistics. In the present 
paper, we define the parity of a statistical Q(K(w),ip(w)) and proved that it affects the 
asymptotic behavior of WBIC. In this subsection we show three mathematical properties 
of the parity of a statistical model. 

Firstly, the parity has a relation to the analytic continuation of K(w) 1 ' 2 . For example, 
by using blow-up, (a, b) = (ai,ai&i) = (02^2, ^2), it follows that analytic continuation of 
(a 2 + b 2 ) 1 / 2 is given by 



(a 2 + 6 2 ) 1 / 2 = ax^lT^ = 6^/al + 1, 



which takes both positive and negative values. On the other hand, (a 4 + 6 4 ) 1 / 2 takes only 
nonnegative value. The parity indicates such difference. 

Secondly, the parity has a relation to statistical model with a restricted parameter set. 
For example, a statistical model 



p(x\a) = , exp(- 

v 7 ^ 



a) 2 , 



whose parameter set is given by {a > 0} is equivalent to a statistical model p(x\b 2 ) 
and {b £ R}. In other words, a statistical model which has restricted parameter set is 
statistically equivalent to another even model which has unrestricted parameter set. We 
have a conjecture that an even statistical model has some relation to a model with a 
restricted parameter model. 



27 



Ill 



And lastly , the pa rity has a relation to the difference of K(w) and K n (w). As is proven 



Watanabd . l2001a| ] , the relation 



— log J exp(— nK n (w))tp(w)dw = — log J exp(—nK(w))ip(w)dw + O p (l) 

holds independent of the parity of a statistical model. On the other hand, if j3 = 1/logn, 
then 



J nK(w) exp(—n(3K(w))ip(w)dw 

f exp(—n/3K(w))tp(w) 
+U n y / logn + O p {l). 



If the parity is odd, then U n = 0, otherwise U n is not equal to zero in general. This fact 
shows that the parity shows difference in a fluctuation of the likelihood function. 



8 Conclusion 

We proposed a widely applicable Bayesian information criterion (WBIC) which can be 
used even if a true distribution is unrealizable by and singular for a statistical model and 
proved that WBIC has the same asymptotic expansion as the Bayes free energy. Also we 
developed a method how to estimate real log canonical thresholds even if a true distribution 
is unknown. 



Acknowledgement 

This research was partially supported by the Ministry of Education, Science, Sports and 
Culture in Japan, Grant-in- Aid for Scientific Research 23500172. 



References 

Miki Aoyagi and Kenji Nagata. Learning coefficient of generalization error in Bayesian 
estimation and Vandermonde matrix-type singularity. Neural Computation, 24(6): 1569- 
1610, 2012. 

Miki Aoyagi and Sumio Watanabe. Stochastic complexities of reduced rank regression in 
Bayesian estimation. Neural Networks, 18(7):924-933, 2005. 

Michael Francis Atiyah. Resolution of singularities and division of distributions. Commu- 
nications of Pure and Applied Mathematics, 13:145-150, 1970. 

Iosif Naumovic Joseph Bernstein. Analytic continuation of distributions with respect to a 
parameter. Functional Analysis and its Applications, 6(4):26-40, 1972. 

Mathias Drton. Reduced rank regression. In Workshop on Singular Learning Theory. 
American Institute of Mathematics, 2010. 

Mathias Drton, Bernd Sturmfels, and Seth Sullivant. Lecures on Algebraic Statistics. 
Birkhauser, Berlin, 2009. 



28 



Israel Moiseevich Gelfand and Georgi Evgen'evich Shilov. Generalized Functions. Volume 
I: Properties and Operations. Academic Press, San Diego, 1964. 

Irving John Good. The Estimation of Probabilities: An Essay on Modern Bayesian Meth- 
ods. MIT Press, Cambridge, 1965. 

Heisuke Hironaka. Resolution of singularities of an algebraic variety over a field of char- 
acteristic zero. Annals of Mathematics, 79:109-326, 1964. 

Masaki Kashiwara. B-functions and holonomic systems. Inventiones Mathematicae, 38: 
33-53, 1976. 

Franz J. Kiraly, Paul von Biiuau, Frank C. Meinecke, Duncan A. J. Blythe, and Klaus- 
Robert Miiller. Algebraic geometric comparison of probability distributions. Journal of 
Machine Learning Research, 13:855-903, 2012. 

Janos Kollar. Singularities of pairs. In Algebraic Geometry, Santa Cruz 1995, Proceedings 
of Symposia in Pure Mathematics, volume 62, pages 221-286. American Mathematical 
Society, 1997. 

Shaowei Lin. Algebraic Methods for Evaluating Integrals in Bayesian Statistics. PhD 
thesis, Ph.D. dissertation, University of California, Berkeley, 2011. 

Dmitry Rusakov and Dan Geiger. Asymptotic model selection for naive Bayesian network. 
Journal of Machine Learning Research, pages 1-35, 2005. 



Morihiko Saito. On real log canonical thresholds. \arXiv:0 707.2308vl, 2007. 

Mikio Sato and Takuro Shintani. On zeta functions associated with prehomogeneous 
vector space. Annals of Mathematics, 100:131-170, 1974. 

Gideon Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461-464, 
1978. 

Alexander Varchenko. Newton polyhedrons and estimates of oscillatory integrals. Func- 
tional Analysis and its Applications, 10(3): 13-38, 1976. 

Sumio Watanabe. Algebraic analysis for singular statistical estimation. Lecture Notes in 
Computer Sciences, 1720:39-50, 1999. 

Sumio Watanabe. Algebraic analysis for nonidentifiable learning machines. Neural Com- 
putation, 13(4):899-933, 2001a. 

Sumio Watanabe. Algebraic geometrical methods for hierarchical learning machines. Neu- 
ral Networks, 14(8): 1049-1060, 2001b. 

Sumio Watanabe. Algebraic geometry and statistical learning theory. Cambridge University 
Press, Cambridge, UK, 2009. 

Sumio Watanabe. Asymptotic equivalence of Bayes cross validation and widely applicable 
information criterion in singular learning theory. Journal of Machine Learning Research, 
11:3571-3591, 2010a. 



29 



Sumio Watanabe. Asymptotic learning curve and renormalizable condition in statistical 
learning theory. Journal of Physics Coneference Series, 233(1), 2010b. 012014. doi: 
10.1088/1742-6596/233/1/012014. 

Keisuke Yamazaki and Sumio Watanabe. Singularities in mixture models and upper 
bounds of stochastic complexity. Neural Networks, 16(7):1029-1038, 2003. 

Keisuke Yamazaki and Sumio Watanabe. Singularities in complete bipartite graph-type 
boltzmann machines and upper bounds of stochastic complexities. IEEE Transactions 
on Neural Networks, 16(2):312-324, 2005. 

Piotr Zwiernik. Asymptotic model selection and identifiability of directed tree models 
with hidden variables. CRiSM report, 2010. 

Piotr Zwiernik. An asymptotic behaviour of the marginal likelihood for general markov 
models. Journal of Machine Learning Research, 12:3283-3310, 2011. 



30 



